1 Introduction

This project develops a model for bike re-balancing for bike sharing platforms to better serve the users. This project focuses on Indego, the biking sharing platform in Philadelphia. Bike sharing has become increasingly popular in American cities as a sustainable and efficient transportation mode. Biking is an environmentally friendly transportation alternative that the society has paid more attention to in recent years as one of many solutions to carbon emission and global climate change. Also, biking with a low Passenger Car Equivalent per Passenger–that means biking is very efficient at moving people through the space compared to high PCE per Passenger modes such as private car–helps to reduce congestion and traffic fatality and improve the overall transportation conditions in urban space. Moreover, bike sharing catches the trend of sharing economy, the importance of which has speedily grown in recent years both for the national economy and for everyday life of citizens. Along with the rise of transportation sharing platforms such as Uber, ZipCar, and Lime, many cities introduced bike sharing programs to advance the agenda for a more sustainable and efficient urban transportation system.

However, bike sharing faces a re-balancing issue as the demand and need for bikes are likely unmatched in spatial terms due to the clustering of destinations or starting points. The re-balancing issue is more prominent when cities use docking stations rather than dockless bike sharing systems. For example, during the AM-rush hour when people are riding bikes from all over the city to main job centers, the influx of a large amount of bikes in a short period of time to the same destination can result in full-load docks that have no room for people to park bikes. On the other hand, during the PM-rush hour when people are riding back home, the increasing need at one time at the same place can lead to a shortage of bikes available for users since the docks can only provide a limited number of bikes at one place. Thus, bike sharing platforms must correctly re-balance bikes at different stations to make sure that people can get or park bikes when they need to.

To re-balance bikes, multiple strategies are useful in collaboration with each other. First, the bike platform can employ a small fleet of trucks to move bikes from one place to another. This method can address the shortage of bikes during rush hours at specific locations. This method is also efficient at large-scale re-balancing when the flow of bikes is more one-direction; for instance, people like to bike home but not bike to work due to the limit on commuting time and results in bike shortage at job centers, trucks can easily move bikes from residential clustering to job centers every night. Second, the bike platform can offer incentives such as a coupon to guide riders to park at certain stations where bike shortage is likely to occur later. Incentives can be less effective and further Cost/Benefit Analysis is required to develop incentive policies. For any possible re-balancing strategy, the model that this project develops can help to predict when and where the shortage is likely to occur to better guide the re-balancing process.

2 Data Wrangling

Data for this project is downloaded from Indego open data. Because this project includes demographic data from the 2019 American Community Survey, this project uses the 2019 September to December data, specifically from week 41 to week 45. We understand that Indegon has expanded its service since 2019, yet we assume that the relationships between rideshare demand and key variables such as weather, time lags, and demographic information continue and our model is generalizable to future data.

ride.PA<-read.csv("/Users/inordia/Desktop/UPenn搞起来/592/MUSA508_Assignment6/508-HW6-main/indego-trips-2019-q4.csv")
ride2 <-
  ride.PA %>% 
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))%>%
  filter(week>=41 & week <= 45)

weather.PA <- 
  riem_measures(station = "PHL", date_start = "2019-10-08", date_end = "2019-11-12")

weather.Panel <-  
  weather.PA %>%
  mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Percipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

grid.arrange(top = "Weather Data - Philadelphia - Oct 8-Nov 12, 2018",
             ggplot(weather.Panel, aes(interval60,Percipitation)) + geom_line() + 
               labs(title="Percipitation", x="Hour", y="Percipitation") + plotTheme(),
             ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
               labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
             ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
               labs(title="Temperature", x="Hour", y="Temperature") + plotTheme())

We use the Philadelphia International Airport weather data for the weather conditions. From week 41 to week 45 of 2019, Philadelphia’s precipitation, wind speed and temperature is shown as above.

#bus stops
bus_stops <- st_read("https://opendata.arcgis.com/datasets/e09e9f98bdf04eada214d2217f3adbf1_0.geojson")%>%
  dplyr::select(Y = Longitude, X = Latitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4236, agr = "constant")%>%
  st_transform('EPSG:2263')

#picnic sites
picnic<-st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+ppr_picnic_sites&filename=ppr_picnic_sites&format=geojson&skipfields=cartodb_id")%>%
  dplyr::select(geometry) %>%
  na.omit()%>%
  st_transform('EPSG:2263')

#schools
school<-st_read("https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson")%>%
  dplyr::select(geometry) %>%
  na.omit()%>%
  st_transform('EPSG:2263')

#playground
playground<-st_read("https://opendata.arcgis.com/datasets/899c807e205244278b3f39421be8489c_0.geojson")%>%
  dplyr::select(geometry)%>%
  st_transform('EPSG:2263')

  

phl.ride <- ride2 %>%
  na.omit() %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326, agr = "constant")%>%
  st_transform('EPSG:2263')

st_c <- st_coordinates

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

phl.ride <-
  phl.ride %>%
  mutate(
    bus_stops.nn =
      nn_function(st_c(phl.ride), st_c(bus_stops),3),
    school.nn =
      nn_function(st_c(phl.ride), st_c(school),3),
    picnic.nn =
      nn_function(st_c(phl.ride), st_c(picnic),3),
    playground.nn =
      nn_function(st_c(phl.ride), st_c(playground),3))

phl.ride <-
  phl.ride%>%
  select(start_station, end_station, bus_stops.nn, school.nn, picnic.nn, playground.nn,
         interval60, interval15, week, dotw)

phl.census<- get_acs(geography = "tract", 
                     variables = c("B01003_001", "B19013_001", 
                                   "B02001_002", "B08013_001",
                                   "B08012_001", "B08301_001", 
                                   "B08301_010", "B01002_001"
                              ), 
                     year = 2019, 
                     state = "PA", 
                     geometry = TRUE, 
                     county=c("Philadelphia"),
                     output = "wide")%>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
                Means_of_Transport, Total_Public_Trans,
                Med_Age,
                GEOID, geometry)%>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

dat_census <- st_join(ride2 %>% 
                        filter(is.na(start_lon) == FALSE &
                                 is.na(start_lat) == FALSE &
                                 is.na(end_lon) == FALSE &
                                 is.na(end_lat) == FALSE) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      phl.census %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phl.census %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon  = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

We download the amenities data from Philadelphia’s open data portal. Considering the trip purposes for bike sharing, we include bus stops, picnic sites, schools, and playgrounds. We then calculate the distance of bike stations to nearest amenities. We also download the 2019 ACS data to include demographic information into our model. We then calculate the percentage of white population, mean commute time, and percentage of people who use public transportation for commuting.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel)%>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, phl.census %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

ride.panel <-
  ride.panel%>%
  left_join(st_drop_geometry(phl.ride) %>% distinct(start_station, .keep_all = TRUE) %>% select(start_station, ends_with(".nn")), by = c("start_station"="start_station"))

#Create time lags
ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
  mutate(day = yday(interval60)) 

We then develop the time-lag feature based on the trip count of each station in the previous time period. For example, a time lag of one hour is the trip count of a station one hour earlier.

All data is organized into a data panel that includes all the possible combinations of time interval and space (bike station). There are 118440 unique space/time units in our panel.

3 Exploratory Analysis

3.1 Splitting Data

ride.Train <- filter(ride.panel, week >= 41 & week <= 43)
ride.Test <- filter(ride.panel, week >= 44 & week <= 45)

Besides this 5-week panel, we split the data into a 3 week training set that includes data from week 41 to week 43 and a 2 week test set that includes data from week 44 to week 45 of all the stations.

3.2 Weather and Trips

It is intuitive to assume that bike trips are affected by weather. The first plot group below shows trip counts as a function of temperature. While the degree of correlation varies, plots show a general pattern that warmer weather is related to more bike trips. The second plot shows the mean of trip counts by whether the day rains or not. Days that did not rain have a higher mean of trip counts; people are more likely to ride bikes when there’s no rain.

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
  geom_point() + geom_smooth(method = "lm", se= FALSE) +
  facet_wrap(~week, ncol=8) + 
  labs(title="Trip Count as a fuction of Temperature by week",
       x="Temperature", y="Mean Trip Count") +
  plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Percipitation = first(Percipitation)) %>%
  mutate(isPercip = ifelse(Percipitation > 0,"Rain/Snow", "None")) %>%
  na.omit()%>%
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
  ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity") +
  labs(title="Does ridership vary with percipitation?",
       x="Percipitation", y="Mean Trip Count") +
  plotTheme()

3.3 Trip Count Serial Autocorrelation

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 

rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>%
  group_by(Legend, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>% 
  ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
  geom_vline(data = mondays, aes(xintercept = monday)) +
  scale_colour_manual(values = palette2) +
  labs(title="Rideshare trips by week in Philly: October 8th - November 11st, 2019",
       x="Day", y="Trip Count") +
  plotTheme() + theme(panel.grid.major = element_blank())

The plot above shows the variation of ridersher trips by week between from week 41 to week 45. The vertical black line represents Monday. There is a shared pattern for all weeks with similar shapes of waves that have constant pikes and throughs at the same day in each week. Week 44 and Week 45, that are in the test set, seem to have slightly lower trips. This plot suggests that time will play an important role in predicting trip numbers.

The table below shows the correlation between time lags and trip counts. There is overall a positive correlation between time lags and trip counts, suggesting that the more trips a station has in previous time periods corresponds to more trips in later time periods. The correlation decreases as time lag is longer, suggesting that the trip number from the most recent time period is more helpful in predicting the later trip number. The high correlation between 1-day lag and trip counts is a result of repeating trip patterns for every day.

plotData.lag <-
  as.data.frame(ride.panel)%>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) %>%
  kable(caption = "Ridershare trip count") %>%
  kable_styling("striped", full_width = F)

correlation.lag
Ridershare trip count
Variable correlation
lagHour 0.44
lag2Hours 0.32
lag3Hours 0.24
lag4Hours 0.18
lag12Hours -0.03
lag1day 0.42

3.4 Trip Count Spatial Autocorrelation

The maps below show the trip counts for all stations by week. There is a clear spatial pattern of clustering for trip counts; stations with higher trip counts tend to be located in Center City, while stations with lower trip counts are scattered at the periphery of the Indego service area. More importantly, trip counts maintain the same clustering pattern throughout the time. Thus, there is a strong spatial autocorrelation for bike trips.

ggplot()+
  geom_sf(data = phl.census)+
  geom_point(data = dat_census %>%
               group_by(start_station, start_lat, start_lon, week)%>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(~week)+
  labs(title="Sum of rideshare trips by station and week",
       subtitle = "Philadelphia, October 8th - November 11st, 2019")+
  mapTheme()

3.5 Space/Time Correlation

While bike sharing in Philadelphia has both strong spatial and temporal autocorrelation, these two elements are not independent of each other. We then explore the space/time correlation for bike sharing trips.

dat_census %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, start_station, time_of_day) %>%
  tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station",
       subtitle="Philadelphia, October 8th - November 11st, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

The plots above show the number of stations by the number of trips that stations have for each time period. Both AM and PM rush hours have a higher frequency of stations with more trips, while most stations have very few trips overnight. In general, a very limited number of stations have a high number of trips despite the time period. Indeed, the plot below confirms that most stations have a low demand while a few stations have high demand. This reinforces the need for bike rebalancing.

ggplot(dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hr by station",
       subtitle = "Philadelphia, October 8th - November 11st, 2019",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme()

The plot below shows the daily trends in ridership by day of the week. The plot suggests that all weekdays share a very similar wave with the same peaks, while weekends have a distinct pattern.

ggplot(dat_census %>% mutate(hour = hour(start_time)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia",
       subtitle = "October 8th - November 11st, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme()

To better understand the differences between weekday and weekend, the plot below shows bike trips by weekday and weekend. Weekdays tend to have more trips and display a more clear pattern of peaks and troughs. Thus, weekdays might need more re-balancing.

ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia - weekend vs weekday",
       subtitle = "October 8th - November 11st, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme()

The map below shows the trip counts for stations by time period of the day and by weekday and weekend. On weekdays, while both AM and PM rush hours are the peak, they have different clustering of hot spots that require rebalancing. On weekends, Mid-day and PM Rush have more trip counts, suggesting a distinct travel pattern.

ggplot()+
  geom_sf(data = phl.census)+
  geom_point(data = dat_census %>%
               mutate(hour = hour(start_time),
                      weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="rideshare trips by station",
       subtitle = "Philadelphia, October 8th - November 11st, 2019")+
  mapTheme()

3.6 Animated Map

To better show the flow of bikes in one day, we select a random day from the sample and map out the trip counts for each station at each 15-minute interval of the day. We then animate maps for each interval to make an animated map that illustrates the flow. Similar to previous findings, Philadelphia displays a trend of moving from the periphery of the service area to Center City in the morning and moving back in the afternoon. Hot spots are more concentrated in Center City during the PM Rush compared to the AM Rush when hot spots are more spreaded. This suggests that PM Rush might require more re-balancing.

week46 <-
  filter(ride2 , week == 42 & dotw == "Mon")

week46.panel <-
  expand.grid(
    interval15 = unique(week46$interval15),
    start_station = unique(ride2$start_station))

station<-ride.panel%>%
  dplyr::select(start_station,start_lon,start_lat)%>%
  dplyr::distinct(start_station,start_lon,start_lat)%>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4236, agr = "constant")%>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))

ride.animation.data <-
  mutate(week46, Trip_Counter = 1) %>%
  right_join(week46.panel) %>% 
  group_by(interval15, start_station) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  left_join(station, by=c("start_station"="start_station")) %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                           Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips"))

rideshare_animation <-
  ggplot() +
  geom_sf(data=phl.census, alpha=0.3)+
  geom_point(data = ride.animation.data, aes(x=start_lon, y = start_lat, color = Trips, size=0.5)) +
  scale_color_manual(values = palette5[2:4]) +
  labs(title = "Rideshare pickups for one day in October 2019",
       subtitle = "15 minute intervals: {current_frame}") +
  transition_manual(interval15) +
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  mapTheme()

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

4 Model

4.1 Model Building

Five linear regression models are developed with the following variables:

Regression 1 Time, Day of Week, Weather

Regression 2 Distance to amenities service, Day of the week, Weather

Regression 3 Spatial Demographics, Day of the week, Weather

Regression 4 Time, Distance to amenities service, Spatial Demographics, Day of the week, Weather

Regression 5 Time lags, Time, Distance to amenities service, Spatial Demographics, Day of the week, Weather

##reg1 focuses on just time, day of the week, and weather
reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed, data=ride.Train)


##reg2 focuses on day of the week, weather, distance to amenities service
reg2 <- lm(Trip_Count ~  dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn, data=ride.Train)


####reg3 focuses on day of the week, weather, space
reg3 <-lm(Trip_Count ~  dotw + Temperature+ Percipitation + Wind_Speed
          +Percent_Taking_Public_Trans+Mean_Commute_Time+Percent_White+Med_Age, data=ride.Train)


##reg4 focuses on time, day of the week, weather, distance to amenities service,space
reg4 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn
           +Percent_Taking_Public_Trans+Mean_Commute_Time+Percent_White+Med_Age, data=ride.Train)


##reg5 focuses on time, day of the week, weather, distance to amenities service, space, time lag features
reg5 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn
           +Percent_Taking_Public_Trans+Mean_Commute_Time+Percent_White+Med_Age
           +lagHour+lag2Hours+lag3Hours+lag4Hours+lag12Hours+lag1day, data=ride.Train)



stargazer(reg1, reg2, reg3, reg4, reg5, type ="html", font.size = "small", single.row = TRUE)
Dependent variable:
Trip_Count
(1) (2) (3) (4) (5)
hour(interval60) 0.033*** (0.001) 0.033*** (0.001) 0.006*** (0.001)
dotw.L 0.150*** (0.014) 0.205*** (0.013) 0.210*** (0.014) 0.154*** (0.014) 0.017 (0.012)
dotw.Q -0.281*** (0.014) -0.248*** (0.014) -0.256*** (0.014) -0.290*** (0.014) -0.163*** (0.012)
dotw.C 0.033** (0.014) 0.009 (0.014) 0.009 (0.014) 0.033** (0.014) 0.050*** (0.012)
dotw4 -0.221*** (0.013) -0.218*** (0.013) -0.225*** (0.014) -0.227*** (0.013) -0.178*** (0.012)
dotw5 0.080*** (0.013) 0.088*** (0.013) 0.091*** (0.014) 0.083*** (0.013) 0.114*** (0.012)
dotw6 0.066*** (0.013) 0.071*** (0.013) 0.074*** (0.014) 0.070*** (0.013) 0.043*** (0.012)
Temperature 0.006*** (0.001) 0.024*** (0.001) 0.024*** (0.001) 0.007*** (0.001) 0.001 (0.001)
Percipitation -0.326*** (0.021) -0.248*** (0.021) -0.253*** (0.021) -0.333*** (0.021) -0.159*** (0.019)
Wind_Speed -0.007*** (0.001) -0.003*** (0.001) -0.003*** (0.001) -0.007*** (0.001) -0.001 (0.001)
bus_stops.nn -0.0004*** (0.00003) -0.0003*** (0.00003) -0.0001*** (0.00003)
school.nn -0.0001*** (0.00001) -0.00001 (0.00001) -0.00000 (0.00001)
picnic.nn 0.0001*** (0.00001) 0.00001* (0.00001) 0.00000 (0.00001)
playground.nn 0.00003*** (0.00001) 0.0001*** (0.00001) 0.00002*** (0.00001)
Percent_Taking_Public_Trans -0.822*** (0.086) -0.310*** (0.098) -0.107 (0.085)
Mean_Commute_Time 0.0002 (0.0002) 0.001*** (0.0002) 0.0003** (0.0001)
Percent_White 0.614*** (0.027) 0.647*** (0.027) 0.231*** (0.024)
Med_Age 0.002*** (0.001) -0.0002 (0.001) -0.00004 (0.001)
lagHour 0.291*** (0.004)
lag2Hours 0.078*** (0.004)
lag3Hours 0.042*** (0.004)
lag4Hours 0.010*** (0.004)
lag12Hours -0.021*** (0.003)
lag1day 0.263*** (0.003)
Constant 0.052 (0.053) -0.692*** (0.052) -0.909*** (0.065) -0.427*** (0.077) -0.004 (0.068)
Observations 71,064 71,064 69,048 69,048 69,024
R2 0.046 0.037 0.049 0.073 0.294
Adjusted R2 0.046 0.037 0.048 0.073 0.294
Residual Std. Error 1.325 (df = 71053) 1.331 (df = 71050) 1.336 (df = 69034) 1.319 (df = 69029) 1.151 (df = 68999)
F Statistic 342.815*** (df = 10; 71053) 210.241*** (df = 13; 71050) 271.044*** (df = 13; 69034) 302.659*** (df = 18; 69029) 1,197.774*** (df = 24; 68999)
Note: p<0.1; p<0.05; p<0.01

4.2 Predicting

We then create a nested data frame of test data to predict trip counts for each station using each one of the five regression models.

ride.Test$week <-as.character(ride.Test$week)

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

ride.Test.weekNest
## # A tibble: 2 × 2
##   week  data                  
##   <chr> <list>                
## 1 44    <tibble [23,688 × 32]>
## 2 45    <tibble [23,688 × 32]>
model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
  mutate(A_Time_weather = map(.x = data, fit = reg1, .f = model_pred),
         B_Amenity_weather = map(.x = data, fit = reg2, .f = model_pred),
         C_Space_weather = map(.x = data, fit = reg3, .f = model_pred),
         D_Time_Amenity_Space_weather = map(.x = data, fit = reg4, .f = model_pred),
         E_TimeLags_Space_Amenity_Time_weather = map(.x = data, fit = reg5, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
## # A tibble: 10 × 8
##    week  data     Regression     Prediction  Observed Absolute_Error   MAE sd_AE
##    <chr> <list>   <chr>          <list>      <list>   <list>         <dbl> <dbl>
##  1 44    <tibble… A_Time_weather <dbl [23,6… <dbl [2… <dbl [23,688]> 0.815 0.909
##  2 45    <tibble… A_Time_weather <dbl [23,6… <dbl [2… <dbl [23,688]> 0.784 0.860
##  3 44    <tibble… B_Amenity_wea… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.823 0.944
##  4 45    <tibble… B_Amenity_wea… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.737 0.921
##  5 44    <tibble… C_Space_weath… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.833 0.933
##  6 45    <tibble… C_Space_weath… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.750 0.922
##  7 44    <tibble… D_Time_Amenit… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.813 0.892
##  8 45    <tibble… D_Time_Amenit… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.784 0.856
##  9 44    <tibble… E_TimeLags_Sp… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.669 0.796
## 10 45    <tibble… E_TimeLags_Sp… <dbl [23,6… <dbl [2… <dbl [23,688]> 0.651 0.759

4.3 Accuracy

The Mean Absolute Error is calculated for each model. The plot below demonstrates the MAE by model by week. Model E generated by regression 5 has a significantly lower MAE than all other models for both test weeks. Thus, Model E that includes time, day of the week, weather, distance to amenities service, spatial demographics, and time lag features is the most accurate model. Compared to other models, time lag features significantly reduce the MAE and makes the model more accurate.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme()

To better understand the accuracy of each model, the plots below show the prediction vs. observation of trip counts by model for the two test weeks. The visualizations clearly show that Model E generated by regression 5 has the best goodness of fit. Yet, Model E still underpredict for peak times.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         from_station_id = map(data, pull, start_station)) %>%
  dplyr::select(interval60, from_station_id, Observed, Prediction, Regression) %>%
  unnest() %>%
  na.omit() %>%
  gather(Variable, Value, -Regression, -interval60, -from_station_id) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
  plotTheme()

The map below shows the distribution of MAE by bike station to illustrate the spatial accuracy of prediction. Stations in Center City generally have higher MAE while stations on the periphery of the service area have lower MAE. Thus, important spatial or time/space factors have been missed to address the spatial clustering of high MAE.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "E_TimeLags_Space_Amenity_Time_weather") %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phl.census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model E")+
  mapTheme()

4.4 Space/Time Error Evaluation

To discuss the space/time error in our model, we first plot the observed vs. predicted values as below for different time periods of day by week and weekend. The red line indicates the prediction trend while the black line indicates the observed trend. Regardless of time periods and weekday/weekend, the model underpredicts for all scenarios.

We then map out the MAE in space by time periods of day and weekday/weekend. Maps demonstrate that MAE are higher for stations around center city during Am Rush and stations in center city during PM Rush, likely due to underprediction that the previous plots suggest.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "E_TimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start_station, start_lat, start_lon,Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "E_TimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phl.census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1, alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme()

We zoom into the rush hours to look at the errors and its relationship with socio-economic factors. Both plots for AM and PM Rush suggest that higher error is related to higher medium income and higher percentage of white population. More spatial factors shall be explored in future models to hold these accountable.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White)) %>%
  select(interval60, start_station, start_lat, 
         start_lon, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  unnest() %>%
  filter(Regression == "E_TimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables, AM",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White)) %>%
  select(interval60, start_station, start_lat, 
         start_lon, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  unnest() %>%
  filter(Regression == "E_TimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "PM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables, PM",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

4.5 Cross-Validation

We conduct a spatial LOGO-CV that leaves one station out every time. The Mean Error is relatively normally distributed while the MAE and Standard Deviation of MAE are skewed to the right. This suggests that the generalizability of this model is decent but has room for improvement.

reg.vars <- c("dotw", "Temperature", "Percipitation","Wind_Speed",
              "bus_stops.nn","school.nn","picnic.nn","playground.nn",
              "Percent_Taking_Public_Trans","Mean_Commute_Time","Percent_White","Med_Age",
              "lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, indVariables, dependentVariable)
    
    regression <-
      glm(Trip_Count ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(allPredictions)
}

reg.cv <- crossValidate(
  dataset = ride.panel,
  id = "start_station",
  dependentVariable = "Trip_Count",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = start_station, Trip_Count, Prediction)

reg.summary <- mutate(reg.cv, Error = Prediction - Trip_Count,
                      Regression = "random k-fold cross validation on the 5 week panel")

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - Trip_Count, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
error_by_reg_and_fold.long <-
  error_by_reg_and_fold%>%
  gather(Vriable, Value, -cvID, -Regression)%>%
  unnest()
ggplot(error_by_reg_and_fold.long, aes(x = Value)) +
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~Vriable, ncol = 3, scales = "free") +
  plotTheme()

4.6 Generablizabity by Context

To further test the generalizability of our model, we map out the cross validation results by race and income context in Philadelphia. Patterns of MAE are somehow obscure in both cases. For race, the service area of Indego is mainly located in relatively white neighborhoods, and a few stations that are located in less white neighborhoods have generally higher MAE. For income, MAE tends to decrease for stations located in richer neighborhoods while increasing for stations located in less rich ones. Since no obvious clustering is prominent for MAE by race and income context, our model is largely generalizable.

geo_info <- phl.ride%>%
  select(start_station, geometry)%>%
  distinct(start_station,.keep_all = TRUE)

error_by_reg_and_fold.geo <-
  error_by_reg_and_fold%>%
  left_join(geo_info, by = c("cvID"="start_station"))%>%
  st_as_sf()

ggplot()+
  geom_sf(data = phl.census, aes(fill = Percent_White))+
  geom_point(data=error_by_reg_and_fold.geo,aes(x = unlist(map(geometry, 1)),
                                            y = unlist(map(geometry, 2)),color=Mean_Error), alpha = 0.9, size=2)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Figure 16. Generalizability by race context",
       subtitle = "LOGO cross validation on the 5 week panel")+
  mapTheme()

ggplot()+
    geom_sf(data = phl.census, aes(fill = Med_Inc))+
    geom_point(data=error_by_reg_and_fold.geo,aes(x = unlist(map(geometry, 1)),
                                              y = unlist(map(geometry, 2)),color=Mean_Error), alpha = 0.9, size=2)+
    scale_colour_viridis(direction = -1,
                         discrete = FALSE, option = "D")+
    ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
    xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
    labs(title="Figure 17. Generalizability by income",
         subtitle = "LOGO cross validation on the 5 week panel")+
    mapTheme()

5 Conclusion

In this project, we develop a model to predict bike trips at each bike sharing station by time to guide the bike rebalancing process that ensures a matched demand and supply for bikes. Among all models that we develop, Model E–that includes time lags, time variables , distance to amenities service, spatial demographic variables, day of week, and weather conditions–perform best as the time lags are strong predictors of bike trips. Yet, our model still underpredicts for rush hours and hot spots. Overall, our model has a decent goodness of fit and generalizability. Thus, we argue that Indego should use our model to design the rebalancing strategies while assuming that the actual demand for bikes will be higher in rush hours. For example, our model points out that weekends will need less re-balancing than weekdays. Also, Indego can develop real-time adjustment to incentives to guide the flow of bikes to certain stations based on the rising or decreasing demand that our model predicts. Further improvements can be made to our model to narrow down the time interval for prediction to better guide the real-time incentives. In general, our model is a useful tool for Indego to better solve the rebalancing issue.